Lecture 6

Hypothesis testing

Prior to this lecture, you should have read chapter 4 of Regression and Other Stories.

Confidence intervals for means and proportions

I have a data set of commute times based on a random sample of 100 California households. Based on that sample, I can calculate an average commute time.

mean(commutes_100a$TRANTIME)
## [1] 31.1

But if I had sampled a different set of 100 households from the same population, I could have gotten a slightly different average.

mean(commutes_100b$TRANTIME)
## [1] 28.21
mean(commutes_100c$TRANTIME)
## [1] 27.01
mean(commutes_100d$TRANTIME)
## [1] 31.99

All of these averages will tend to be clustered around the actual population average, even if none of them will be exactly equal to the population average.

A one-sample t-test uses the mean and standard deviation of a sample to calculate a confidence interval for the population mean - a range of values that the the real average of the population probably falls within.

Here is how you would get a 90-percent confidence interval for the average commute time in R.

t.test(commutes_100a$TRANTIME, conf.level = 0.9)
## 
##  One Sample t-test
## 
## data:  commutes_100a$TRANTIME
## t = 11.863, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  26.7473 35.4527
## sample estimates:
## mean of x 
##      31.1

Look at the two values listed under 90 percent confidence interval:. You can interpret that to mean that you can be 90 percent confident that averate commute time for the full population would be between 26.7 and 35.5 minutes.

You can also calculate confidence interval for the population mean in Excel, but you’ll need do do it in a couple steps.

First, you would calculate the standard deviation and average of your sample data. Then, use the CONFIDENCE.T() function to calculate the margin for the confidence interval using three arguments: alpha (one minus the confidence level), the standard deviation, and the sample size (in the example below, I use the COUNT() function to get the sample size). The confidence interval is the sample mean, plus or minus this value.

What influences the width of the confidence interval?

Three things influence a the width of a confidence interval:

Confidence intervals for proportions

You can also use a one-sample t-test to calculate the confidence interval for the proportion of the population that falls into a category. Here is how I would find the 90 percent confidence interval for the proportion of the population that commutes by car.

t.test(commutes_100a$mode == "Car", conf.level = 0.9)
## 
##  One Sample t-test
## 
## data:  commutes_100a$mode == "Car"
## t = 28.302, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  0.8377863 0.9422137
## sample estimates:
## mean of x 
##      0.89

This result means I can be 90 percent confident that the share of the full population that commutes by car is between 84 percent and 94 percent.

I can get a similar result in Excel.

Average values within categories

You can use group_by() and get_summary_stats() in R to produce a table that gives an average value within each category, along with the 95-percent confidence interval for each average.

library(rstatix)

income_by_mode <- commuting %>%
  group_by(mode) %>%
  get_summary_stats(INCTOT, type = "mean_ci") %>%
  mutate(ci_low = mean - ci,
         ci_hi = mean + ci)

income_by_mode  %>%
  kable(digits = c(rep(0, 15), 3, 0)) %>%
  scroll_box(width = "75%")
mode variable n mean ci ci_low ci_hi
Bike INCTOT 8390 73987 2049 71938 76037
Car INCTOT 715885 68652 198 68454 68850
Other INCTOT 13682 69853 1633 68220 71487
Transit INCTOT 42950 70218 834 69384 71053
Walk INCTOT 23893 47564 963 46601 48528

In the table above, the 95-percent confidence interval for the average income of those who bike to work is $71,938 to $75,037. In other words, we can be 95-percent confident that the average values for all cyclists in the full population is within that range.

Error bars can be a helful way to visualize these confidence intervals.

ggplot(income_by_mode) +
  geom_col(aes(x = mode, y = mean)) +
  geom_errorbar(aes(x = mode,
                    ymin = ci_low,
                    ymax = ci_hi),
                width = 0.2) +
  scale_y_continuous(name = "Average income",
                     breaks = breaks <- seq(0, 90000, by = 10000),
                     labels = paste0("$", 
                                     prettyNum(breaks, 
                                               big.mark = ","))) +
  scale_x_discrete(name = "Usual mode of travel to work") +
  theme_minimal()

Comparing means across categories

If the population average within a category is a range rather than a single number, how do you compare the averages between two groups?

A two-sample t-test can tell us if there is a statistically significant difference in the averages between two categories.

A statistically significant difference means we can have an acceptable level of confidence (usually 95 percent confidence) that the two averages are not the same.

Here is how you would calculate the difference in average income between all possible pairs of mode categories.

library(rstatix)

comp_income_by_mode <- commuting %>%
  t_test(INCTOT ~ mode, detailed = TRUE, conf.level = 0.9)

comp_income_by_mode  %>%
  kable(digits = c(rep(0, 15), 3, 0)) %>%
  scroll_box(width = "75%")
estimate estimate1 estimate2 .y. group1 group2 n1 n2 statistic p df conf.low conf.high method alternative p.adj p.adj.signif
5336 73987 68652 INCTOT Bike Car 8390 715885 5 0 8546 3608 7063 T-test two.sided 0.000 ****
4134 73987 69853 INCTOT Bike Other 8390 13682 3 0 17986 1935 6333 T-test two.sided 0.006 **
3769 73987 70218 INCTOT Bike Transit 8390 42950 3 0 11339 1912 5626 T-test two.sided 0.003 **
26423 73987 47564 INCTOT Bike Walk 8390 23893 23 0 12298 24523 28323 T-test two.sided 0.000 ****
-1201 68652 69853 INCTOT Car Other 715885 13682 -1 0 14085 -2582 180 T-test two.sided 0.304 ns
-1567 68652 70218 INCTOT Car Transit 715885 42950 -4 0 47906 -2286 -847 T-test two.sided 0.002 **
21087 68652 47564 INCTOT Car Walk 715885 23893 42 0 25947 20262 21913 T-test two.sided 0.000 ****
-365 69853 70218 INCTOT Other Transit 13682 42950 0 1 21286 -1904 1174 T-test two.sided 0.696 ns
22289 69853 47564 INCTOT Other Walk 13682 23893 23 0 23245 20697 23880 T-test two.sided 0.000 ****
22654 70218 47564 INCTOT Transit Walk 42950 23893 35 0 55719 21585 23723 T-test two.sided 0.000 ****

Correlations

The correlation between two continuous variables is a measure of how closely their scatter plot resembles a straight line or how well the value of one variable can predict the value of the other. Correlations can range from negative 1 (a downward-sloping straight line) to positive 1 (an upward-sloping straight line).

A correlation of zero means there is no (linear) relationship between the two variables.

Remember that a variable with a log-normal distribution will have a lot of small values that are close together, and a few more spread-out larger values.

Here is a scatter plot of two log-normally distributed variables.

ggplot(commutes_5000) +
  geom_point(aes(x = INCTOT, y = TRANTIME),
             size = 0.1) +
  theme_minimal()

And here is the same set of variables with the x- and y-axes on a log scale.

ggplot(commutes_5000) +
  geom_point(aes(x = INCTOT, y = TRANTIME),
             size = 0.1) +
  scale_x_continuous(trans = "log") +
  scale_y_continuous(trans = "log") +
  theme_minimal()

You’ll find that the correlation between the two variables is less than the correlation between the logs of the two variables.

cor(commutes_5000$INCTOT, 
    commutes_5000$TRANTIME)
## [1] 0.08644618
cor(log(commutes_5000$INCTOT), 
         log(commutes_5000$TRANTIME))
## [1] 0.1613641

This means that there is a relationship between these two variables, but it isn’t a linear relationship.

Here’s a simpler (and more extreme) example. There is clearly a strong relationship between these two variables.

ggplot(square) +
  geom_point(aes(x = X, y = Y)) +
  theme_minimal()

But the correlation between X and Y in the plot above is zero.

cor(square$X, square$Y)
## [1] 0

I can transform X by squaring it.

ggplot(square) +
  geom_point(aes(x = X^2, y = Y)) +
  theme_minimal()

The correlation between X and Y was zero, but the correlation between the square of X and Y is 1.

cor(square$X^2,
    square$Y)
## [1] 1

Confidence intervals for correlations

Just because there is a non-zero correlation between two variables in our sample, that doesn’t mean there would be a non-zero correlation between those variables for our full sample. We can also calculate a confidence interval for a correlation.

cor.test(log(commutes_5000$INCTOT), 
         log(commutes_5000$TRANTIME))
## 
##  Pearson's product-moment correlation
## 
## data:  log(commutes_5000$INCTOT) and log(commutes_5000$TRANTIME)
## t = 11.557, df = 4996, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1342399 0.1882468
## sample estimates:
##       cor 
## 0.1613641